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ABSTRACT  *  *  *  1 — 

We  survey  a  new  way  to  get  quick  estimates  of  the  values  of  simple 
statistics  (like  count,  mean,  standard  deviation,  maximum,  median,  and  mode 
frequency)  on  a  large  data  set.  This  approach  is  a  comprehensive  attempt 
(apparently  the  first)  to  estimate  statistics  without  any  sampling,  by  reasoning 
about  various  sets  containing  a  population  of  interest.  Our  "antisampling" 
techniques  have  connections  to  those  of  sampling  (and  have  duals  in  many  cases), 
but  they  have  different  advantages  and  disadvantages,  making  antisampling 
sometimes  preferable  to  sampling,  sometimes  not.  In  particular,  they  can  only  be 
efficient  when  data  is  in  a  computer,  and  they  exploit  computer  science  ideas 
such  as  production  systems  and  database  theory.  Antisampling  also  requires  the 
overhead  of  construction  of  an  auxiliary  structure,  a  "database  abstract".  Tests 
on  sample  data  show  similar  or  better  performance  than  simple  random  sampling. 
We  also  discuss  more  complex  methods  of  sampling  and  their  disadvantages. 
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1.  Introduction 


We  in  developing  a  new  approach  to  estimation  of  atatistfea.  This  technique,  called 
"aatiaaiwplugw ,  ie  fundamentally  different  from  known  techniques  in  that  it  does  not 
involve  sampling  in  any  form.  Bather,  it  is  a  sort  of  inverse  of  sampling. 

Consider  some  finite  data  population  P  that  we  wish  to  study  (see  figure  1).  Suppose 
that  P  is  large,  and  it  is  too  much  work  to  calculate  many  statistics  on  it,  even  with  a 
computer.  For  instance,  P  might  be  one  million  records,  too  big  for  the  main 
memories  of  most  personal  computers.  We  would  have  to  store  it  on  disk,  requiring 
minutes  to  transfer  to  main  memory  for  a  calculation  of  a  single  mean.  If  we  are  in  a 
hurry,  or  if  we  are  doing  exploratory  data  analysis  and  are  Just  interested  in  a  rough 
estimate  of  the  statistic,  this  is  too  long.  So  we  could  create  a  sample  S  of  P,  a 
significantly  smaller  selection  of  items  from  P,  and  calculate  statistics  on  S  rather  than 
ea  P,  extrapolating  the  results  to  P. 

But  there  is  another  way  we  could  estimate  statistics  on  P,  by  coming  from  the  other 
direction.  We  could  take  some  larger  set  known  to  contain  P  —  call  it  A  for 
"antisample**  —  and  calculate  statistics  on  it,  then  extrapolate  down  to  P. 
Downwards  inference  might  be  preferable  to  upwards  inference  from  a  sample, 
because  an  antisample  can  contain  more  information  than  a  sample  because  it  is 
bigger.  For  instance,  sample  8  may  be.  missing  rare  but  important  data  items  in 
population  P  that  are  in  antisample  A. 

But  there  seems  to  be  a  big  problem  with  antisamplng:  antisample  A  must  be  larger 
than  population  P,  and  it  would  seem  more  work  to  calculate  statistics  on  A  P. 
But  not  necessarily.  An  important  prfaiciple  of  economics  is  that  coat  can  be 
amortised,  distributed  across  many  uses.  Just  as  the  cost  of  development  of  a 
package  of  statistical  routines  can  be  distributed  over  many  purchasers,  the  work  of 
calculating  statistics  on  an  antlaampla  A  can  be  charged  to  many  uses  of  those 
statistics.  We  can  do  this  if  we  choose  an  interesting  antisample  that  people  often  ask 
questions  about.  Of  course,  we  don't  have  to  confine  ourselves  to  one  we 

can  have  a  representative  set  of  them,  a  "database  abstract"  for  a  particular  universe 
of  data  populations  or  database.  There  are  many  excellent  situations  fee 
amortisation.  For  instance,  Ufi.  Census  aggregate  statistics  on  population  and  income 
are  used  by  many  different  researchers  for  many  different  purposes,  and  laws  require 
periodic  publication  of  this  information  anyway. 

Two  caveats  regarding  these  techniques  are  necessary,  however.  First,  a  form  of  the 
"closed  world  assumption”  important  in  database  research  [11]  is  necessary:  we  can 
only  make  inferences  about  items  within  the  antisample,  and  not  any  larger 
population.  This  means  that  if  the  populations  themselves  are  samples  (a  "hybrid" 
approach  of  anHsampifeg  and  sampling)  we  cannot  make  inferences  about  the  larger 
populations  from  which  those  samples  are  drawn,  which  makes  the  approach  not  very 
usefed.  Second,  as  with  sampling,  estimates  are  approximate.  Since  and 

sampling  are  rather  different  methods  of  estimation,  sometimes  antisampling  is  better 
than  samplfeg,  but  sometimes  it  is  worse.  Generally  antisamplng  is  only  a  good  idea 
when  one  or  more  of  three  conditions  hold:  (1)  users  are  doing  exploratory  data 
analysis,  the  initial  stages  of  statistical  study;  (3)  users  are  statistically  naive;  or  (S) 
data  is  predominantly  kept  on  secondary  (magnetic  or  optical  disk)  or  tertiary 
(magnetic  tape)  storage. 


*  This  tsnslsilup  b  —seostod  hr  »h*  swttsf^sstlmsttsr  dichotomy  Is  physics.  Astissmplisi  is  s  son  of 

•ppoUSs  to  ssnpSss,  ssisg  opposites  of  ssapUss  tochsiqsos. 


1  The  ualo0  of  ‘iHtimpMin  to  wmpHi 


2.1.  Anas  of  correspondence  of  antUampling  and  sampling 

Aatlogi  of  newly  *11  the  nme  techniques  can  be  need  with  antisamplng  ae  with 
MmpKsg.  Far  instance  if  the  sum  statistic  on  a  sample  S  is  T,  then  the  extrapolation 
rule  for  the  sum  statistic  inferred  on  the  population  P  is  T  times  the  ratio  of  the  site  of 
P  to  the  sise  of  S.  Similarly,  if  a  sum  statistic  on  an  antisample  is  T  the  extrapolated 
estimate  of  the  statistic  on  P  is  T  times  the  ratio  of  the  sise  of  P  to  the  sise  of  A.  For 
another  thing,  both  sampling  and  antisampling  can  combine  estimates  based  on 
multiple  samples  and  multiple  antisamples  using  rations  methods.  With  mpBng 
this  is  perhaps  more  common  since  a  study  population  can  be  specified  as  the 
intersection  of  sereral  antisample  "parent"  sets. 

Antieaasptlng  is  in  feet  a  more  natural  direction  for  "»»Mug  inferences,  since  the 
inference  rules  (equivalently,  "estimation  methods")  used  with  ■*"liJhi|  are  often 
derived  from  assuming  a  class  of  distributions  representing  the  data  population  and 
reasoning  what  characteristics  of  the  sample  would  be,  then  tneertfeg  this  and 
reasoning  backwards.  So  aatlsamplag  rales  are  derived,  then  inverted  to  get 
sampling  rules.  Thus  we  expect  less  uncertainty  associated  with  antisamplng  than 
sampling.  Note  also  another  reason:  sampling  requires  assumptions  of  the  form  of  the 
distribution  from  which  the  sample  is  drawn,  while  antisamplng  does  not  use  such 
information.  But  there  is  a  concomitant  disadvantage  of  antisamplng:  the  population 
about  which  inferences  are  drawn  will  not  usually  be  random  with  respect  to  the 
antisamples.  We  can  assume  it  is  random  to  got  "reasonable-guess"  estimates  of 
statistics,  but  this  wffl  get  us  into  trouble  when  different  attributes  of  the  data  are 
strongly  correlated  and  the  query  mentions  the  correlated  attributes.  Another 
approach  is  to  store  many  eorrdatiou  (linear  or  noulfeear)  statistics  about  an 
antisample  so  that  the  rand  news  s—  of  a  population  within  an  anthample  may  be 
estimated.  These  complexities  have  a  partial  compensation  in  bounding  capabilty,  a 
special  property  of  anttaa wiping  not  shared  by  saxaping,  discussed  in  the  next  section. 

One  important  aspect  of  antise wiping  deserves  emphasis,  however.  Unlike  vamping, 
antis  swiping  is  knowledge-intensive  i  H  requires  construction  of  a  special  auxilary 
structure,  the  database  abstract.  This  makes  antisamplng  systems  Ike  the  expert 
systems  of  artificial  iateUgence  [•],  requiring  for  construction  carefol  cooperation  of 
experts  in  the  domain  of  the  data.  This  is  because  the  choice  of  Just  what  data  to  put 
in  the  database  abstract  Is  important.  One  could  Just  parameterise  the  distribution  of 
each  attribute,  then  parameterise  the  two-dimensional  distribution  for  each  pair  of 
attributes,  and  so  on,  as  (•]  does,  but  this  exhaustive  approach  fells  to  take  advantage 
of  many  redundancies  between  the  various  distributions  involved.  After  all,  there  are 
an  infinite  number  of  possible  statistics,  subsets,  and  attributes  (included  derived  ones) 
on  a  finite  database,  and  even  with  strong  complexity  limits  on  queries  the 
combinatorial  possibilities  can  be  immense.  Correlations  between  attributes  can  be 
quantified  as  statistics  too,  by  regression  coefficients.  Expertise  with  the  data  is  thus 
required  to  advise  what  statistics  best  summarise  it.  This  must  be  traded  off  with  the 
frequency  that  users  ask  particular  sorts  of  queries  (perhaps  weighted  by  utilities  of 
query  answers  to  them).  Both  normative  (e.g.  mean)  and  extremum  (e.g.  maximum) 
statistics  are  durable  for  the  abstract,  to  characterise  both  the  common  and  the 
uncommon  in  the  data,  since  users  will  want  to  ask  about  both.  Important  sets  of 
related  items  formed  by  partitiowlag  the  values  of  each  attribute  should  be  the  sets  on 
which  statistics  are  computed  (what  we  in  [IS]  cal  "first-order  sets",  and  what  [•] 


1.3.  Absolute  bounds  and  production  systems 

Antlsumpln  supports  a  different  kind  of  inference  virtually  impossible  with  sampling: 
reasoning  about  absolnte  bounds  on  statistics.  Suppose  we  know  the  wurimimi  and 
minimum  of  some  attribute  of  an  antisample  A.  Then  since  P  must  be  contained 
entirely  within  A,  any  maximum,  minimum,  mean,  median,  or  mode  of  P  is  bounded 
above  and  below  by  the  maximum  and  minimum  on  A.  But  you  can't  do  thk  the 
other  way  around:  given  the  maximum  and  minimum  of  a  sample  S,  you  have  no  idea 
what  the  largest  possible  value  or  smallest  possible  value  on  the  population  P  is  far  the 
maximum,  minimum,  mean,  median,  or  mode  on  P.  With  particular  assumptions 
about  P  and  S  you  can  put  confidence  limits  on  statistics  of  P  —  say  if  you  assume 
that  S  is  a  random  sample  drawn  from  P,  and  that  P  doesn't  contain  any  extreme 
outliers,  the  mean  of  S  will  tend  to  be  dose  to  the  mean  of  P,  with  a  certain  standard 
deviation.  But  assumptions  like  these,  common  in  statistics,  are  messy  and 
uncosnfcrtable  for  computer  scientists.  There  is  a  qualitative  difference  between  being 
sure  and  being  completely  sure.  If  one  can  obtain  a  tight  absolute  bound.  It 
should  be  preferable  to  an  estimate  with  a  confidence  interval. 

But  a  serious  objection  may  be  raised  to  absolute  bounds  as  opposed  to  estimates  and 
confidence  intervals:  they  can  sometimes  be  very  weak  because  they  must  account  foe 
a  few  highly  extreme  but  possible  eases.  There  are  four  answers  to  this.  First,  many 
uses  of  statistics  do  not  require  high  degrees  of  accuracy.  If  one  is  doing  exploratory 
data' analysis,'  the  statistic  may. Just  be  used  to  get  an  an  Idea  of  the  order  of  the 
magnitude  of  some  phenomenon  in  the  database,  and  abeolute  bounds  within  an  order 
of  magnitude  am  quits  satisfactory  [10],  Also,  there  are  situations  where  statistics  are 
used  fee  comparison,  and  the  only  question  is  whether  the  statistic  is  greater  than  or 
lam  than  a  value,  as  in  choosing  the  best  way  to  process  a  database  retrieval  from  one 
of  several  equivalent  methods  based  on  estimated  sises  of  the  sets  involved  [4], 
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a  that 


obtain 


its  bounds  often  are  easier  to  calculate  than  estimates.  The  usual  need 
oal  assumptions  means  many  move  parameters  in  estimating  than 
good  demonstration  is  in  section  4.1:  estimates  lead  to  nonlinear 
k  exponentials  and  no  closed-form  solution,  while  bounds  lead  to 
tat  can  be  bandied  with  standard  parametric  optimisation  methods  to 
hem  expressions  The  easier  camputabttty  of  bounds  has  long  boon 
computer  science,  as  in  the  theory  of  algorithms  where  worst-case 
the  O  notation  is  mors  common  than  the  complexities  of  probabBistfe 
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Is  bounds  can  be  mads  tigher  with  associated  assumptions  of  reasonable 
m  unspecified  statistics.  Tor  example,  Chebyshev’s  inequality  says  that 
a  a  fraction  #*/  D*  items  can  lie  mors  than  D  from  the  mean  of  a 
But  If  the  distribution  has  a  single  mode  close  to  the  mean,  the  Camp- 
ally  gives  results  about  twice  as  good.  Other  inequalities  cover  other 


dy  weak  absolute  bounds  on  the  value  of  statistics  can 
in  sight  in  the  field  of  artificial  intelligence:  many  small 
can  combine  to  give  strong  information.  And  with 
i  the  combining  is  easy:  Just  take  the  minimum  of  the 
sum  of  the  lower  bounds,  to  get  cumulative  upper  and 
itional  or  independence  assumptions  are  required.  Often 
lag  can  lead  to  different  bounds  cm  the  same  quantity, 
nut  to  combine  al  these  different  methods  into  a  single 
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formula.  Section  4  gives  some  example*. 


Expressing  reasoning  methods  as  a  number  of  smaU,  isolated  pieces  of  Information  is 
the  idea  behind  the  artificial-inteligence  concept  of  a  "production  system"  [2],  a 
programming  architecture  that  has  been  applied  to  many  interesting  problems.  It  is 
the  opposite  extreme  to  the  notion  of  a  computer  as  a  sequential  processor,  as  for 
instance  in  an  optimisation  program  that  uses  a  single  global  measure  to  guide  search 
for  a  solution  to  a  complicated  problem.  In  a  production  system  there  is  no  such 
global  metric,  only  pieces  of  separate  knowledge  about  the  problem  called  "production 
rules”,  all  competing  to  apply  themselves  to  a  problem.  Production  systems  are  good 
at  modeling  complex  situations  where  there  are  many  special  cases  but  no  good  theory 
for  aceompttahing  things.  Thus  reasoning  about  absolute  bounds  given  statistics  on 
■ntisamphis  seems  a  natural  application  for  production  systems.  It  has  some 
similarities  to  symbolic  algebraic  manipulation,  which  often  uses  this  sort  of 
architecture  [19].  We  can  use  a  number  of  more  sophisticated  techniques  developed  in 
artificial  intdHgence  to  avoid  redundant  computation  in  a  production  system,  as  for 
instance  relaxation  methods  or  "constraint  propagation"  [3j.  We  can  also  write 
estimation  methods  as  rules,  and  combine  both  estimates  and  bounds  into  a 


S.  .  A  short  demonstration 

To  show  a  little  of  what  this  approach  can  accomplish,  we  show  some  behavior  for  a 
partial  Inter Isp  implementation,  as  of  February  IMS.  (We  have  done  work  since  then 
an  a  more  complete  Prolog  implementation,  but  have  not  put  it  together.)  The 
database  abstract  includes  simple  statistics  an  all  first-order  (single-word-name)  sets, 
including  statistics  on  each  ship  nationality,  ship  type,  and  nu^Jor  geographical  region. 
No  correlations  between  attributes  are  exploited.  "Guess"  is  the  estimate;  "guesa- 
error"  the  standard  deviation  associated  with  that  estimate;  "upper-limit"  and 
"lower-limit"  are  the  absolute  bounds  on  the  answer.  The  "actual  answer"  is  found  by 
going  afterwards  to  the  actual  data  and  computing  the  exact  value  of  the  statistic. 
The  system  does  not  understand  English  —  we  have  Just  paraphrased  our  formal  query 
language  to  make  it  easier  to  read.  For  more  details  and  demonstrations  see  [IS]. 
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(GUESS:  12.6  GUESS-ERROR:  3.3  UPPER-LIMIT:  63 
LOWER-LIMIT:  3) 

(ACTUAL  ANSWER  IS  14) 


What's  the  frequency  of  the  most  common  tanker  clue 
amoag  the  French,  Italian,  American,  and  Britiah? 
(GUESS:  18.5  GUESS-ERROR:  2.2  UPPER-LIMIT:  25 
LOWER-LIMIT:  15) 

(ACTUAL  ANSWER  IS  18) 

What’s  the  mean  longitude  for  Liberian  ships  of  type  ALI 
not  in  the  Mediterranean? 

(GUESS:  49.0  GUESS-ERROR:  42.4  UPPER-LIMIT:  176 
LOWER-LIMIT:  6) 

(ACTUAL  ANSWER  IS  44.75) 

What’a  the  mean  distance  of  ALl-type  ships  from  30N5W? 
(GUESS:  51.0  GUESS-ERROR:  12.3  UPPER-LIMIT:  57.1 
LOWER-LIMIT:  6.0) 

(ACTUAL  ANSWER  IS  42.34673) 

What’s  the  moat  common  registration-nationality  region 
for  type  ALI  ships  currently  in  the  Mediterranean? 
(GUESS:  46.6  GUESS-ERROR:  9.3  UPPER-LIMIT:  78 
LOWER-LIMIT:  26) 

(ACTUAL  ANSWER  IS  37) 


4.  Three  examples 

La  this  short  paper  it  fa  ImpnerfNa  to  describe  the  varied  categories  of  Inference  ralaa 
on  aatteaaapls  statistics  that  we  have  stadied.  (IS]  and  [18]  provide  overviews,  and 
the  latter  provides  additional  details  and  many  tramples  Bat  tor  fflaetratkm  we 
present  three  important  categories. 


4.1.  Bounding  the  else  of  set  intersections 

Set  intersections  (or  equivalently,  conjunctions  of  restrictions  on  a  query  set)  are  very 
common  in  user  queries  to  databases.  Efficient  processing  requires  good  methods  fur 
estimating  their  counts  or  rises  in  advance. 

If  we  know  the  sises  of  the  sets  being  intersected,  then  an  upper  bound  on  the  size  of 
the  intersection  is  the  minimum  of  the  set  sizes.  A  lower  bound  is  the  sum  of  the  set 
siseo  minus  the  product  of  the  else  of  the  database  and  one  minus  the  number  of  sets 
bring  intersected,  or  sero  if  this  is  negative. 

We  can  do  better  if  we  have  more  statistics  on  the  antisamples.  If  we  know  the  mode 
frequencies  and  number  of  distinct  values  on  sosne  attribute,  then  an  upper  bound  is 
the  product  of  the  minimum  mode  frequency  over  all  sets  with  the  minimum  number 
of  distinct  values  of  a  set  over  all  sets.  Sometimes  this  bound  will  be  better  than  the 
upper  bound  in  the  last  paragraph,  and  some  times  not.  We  can  see  that  if  the  two 
minima  occur  Jbr  the  same  set,  the  bound  will  be  more  than  the  size  of  that  set,  since 
the  product  of  a  mode  frequency  and  number  of  distfect  values  for  a  single  set  must  be 
more  than  the  else  of  a  set.  On  the  other  hand,  consider  two  sets  of  sizes  1000  and 
9000,  with  mode  frequencies  on  some  attribute  100  and  800  respectively,  and  with 
numbers  of  distinct  values  50  and  5  respectively.  Then  the  simple  bound  of  the  last 


paragraph  is  1000,  bat  the  frequency-information  bound  Is  min (100,500)  *  min(50,5) 
*  500  which  la  batter  (smaller).  So  both  approaches  are  needed. 


We  can  generalise  this  method  to  cases  where  we  know  more  detailed  information  of 
the  frequency  distributions  of  the  seta.  We  Just  superimpose  the  frequency 
distributions  and  take  the  minimum  of  the  superimposed  frequencies  for  each  value . 
See  figure  3. 

If  instead  (or  In  addition  to)  frequency  information  we  have  maxima  and  minima  on 
earns  attribute,  we  may  be  able  to  derive  bounds  by  another  method.  An  upper 
bound  on  the  maximum  of  a  set  intersection  is  the  smallest  of  the  maxima  on  each  set, 
and  a  lower  bound  is  the  largest  of  the  minima  on  each  set.  See  Figure  3.  Hence  an 
upper  bound  on  the  else  of  an  intersection  is  the  number  of  items  in  the  entire 
database  having  values  between  that  cumulative  maximum  and  minimum.  If  the 
maxima  are  all  identical  and  the  minima  are  all  identical,  then  the  cumulative 
maximum  and  minima  are  the  same  as  on  any  of  the  sets  being  intersected,  so  the 
eimple  (set-rise)  bound  will  always  be  better.  But  the  maxiim»m.mh»tan«m  bound  can 
be  an  excellent  one  whenever  two  or  more  of  the  sets  being  intersected  have  very 
different  ranges,  as  when  we  are  intersecting  two  sets  with  ranges  100  to  600  and  450 
to  750  respectively,  and  the  cumulative  range  is  450  to  600,  and  there  are  few  items  in 
the  database  with  those  particular  values  —  we  can  then  impoee  an  upper  bound  on 
the  Intersection  aise. 

We  can  also  use  sums  (or  equivalently,  means)  on  attributes.  Suppor  i)  we  wish  to 
estimate  the  else  of  the  intersection  of  only  two  sets;  (3)  one  set  is  «  ^artHion  of  the 
database  far  the  values  of  some  numeric  attribute;  (S)  we  know  "  dues  this 
attribute  can  have;  and  (4)  know  the  rise  and  mean  of  both  sets.  Then  we  can  write 
two  linear  Diophantiae  ( integer-solution )  equations  with  the  number  of  items  having 
each  poesihle  value  of  the  attribute  being  the  unknowns,  and  solve  for  a  finite  set  of 
possiWIItieB  We  can  then  take  the  minima  of  the  pairs  of  the  maximum  possible 
values  few  each  values,  and  sum  to  get  an  upper  bound  on  the  aise  of  the  intersection. 
Diophantiae  equations  tend  to  support  powerful  inferences,  since  the  integer-solution 
constraint  is  a  very  strong  one.  There  turn  out  to  be  many  related  phenomena  that 
can  give  additional  constraints  on  the  variables,  making  inferences  even  better.  See 


can  bound  the  aise  of  set  intersections  as  discussed  in 


8everal  other  kinds 

(!•]. 


i  of  moaotonicaDy 


values 


Suppose  we  know  the  means  and  standard  deviations  of  some  antisamples.  Suppose 
we  are  Interacted  in  the  logarithms  of  the  original  data  values.  (Sometimes  different 
transformations  on  the  same  data  values  are  aO  usefhl,  or  sometimes  we  may  not  be 
sure  when  we  create  the  sntiaasnplaa  what  the  best  transformation  is,  or  sometimes 
different  ranges  of  the  data  values  require  different  transformations  for  best  analysis.) 
And  suppose  we  are  interested  in  knowing  the  mean  of  the  transformed  data  values. 
[16]  amanita  as  this  problem  in  detafl;  we  summarise  it  here. 

A  variety  of  classical  techniques  has  been  applied  to  this  problem.  For  instance,  you 
can  approximate  the  logarithm  curve  by  a  three-term  Taylor-series  approximation  at 
the  moan,  giving  as  an  estimate  of  the  mean  of  the  logarithms  k>g(#»)-(e*/  2p’)  But  it 
Is  hard  to  obtain  confidence  intervals  on  this  result  to  quantify  its  degree  of 


uncertainty,  though  several  methods  have  been  tried  [7].  This  estimate  is  always 
biased,  and  sometimes  is  an  impossibility  (when  it  gives  a  value  unachievable  with  any 
possible  distribution  consistent  with  the  original  mean  and  standard  deviation). 

Role* based  inferences  abont  bounds  provide  an  appealing  alternative.  Several  simple 
methods  bound  the  mean  of  the  logarithms,  no  one  the  best  for  all  situations.  We  can 
try  them  all,  separately  for  upper  and  lower  bounds,  and  combine  results. 

1.  Linear  approximation  bounds.  We  can  draw  lines  that  lie  entirely  above  or  entirely 
below  the  the  function  we  are  approximating  on  an  interval.  For  many  curves,  the  best 
upper  bound  line  is  found  by  taking  the  tangent  at  the  mean,  and  the  best  lower  bound  line 
is  found  by  drawing  a  secant  across  the  curve  from  the  smallest  data  value  to  the  largest 
data  value.  See  Figure  4. 

2.  Quadratic-approximation  bounds. 

A.  Taylor- series.  That  is,  bounds  curves  from  first  three  terms  of  a  Taylor-series 
about  some  point  on  the  function. 

B.  Chebyshev-Lagrange.  That  is,  a  quadratic  LaGrange  interpolating  polynomial 
passing  through  the  three  points  of  the  function  that  are  optimal  for  Chebyshev- 
approximation. 

C.  Special-purpose.  For  particular  functions  (e.g.  reciprocal  and  cube),  particularly 
tight  bounds  because  of  peculiarities  of  the  mathematics  of  those  functions. 

D.  Pseudo-order-statistic.  Taylor-series  approximations  improved  by  Chebyshev’s 
inequality  and  related  inequalities. 

S.  Order  statistics.  If  we  know  medians  or  quantiles  we  can  break  up  the  approximation 
problem  into  subintervals  corresponding  to  each  quantile  range,  and  solve  a  subproblem  on 
each. 

4.  Optimisation.  We  can  iteratively  converge  to  optimal  bounds  for  a  class  of  bounding 
curves,  by  expressing  the  class  parametrically  and  optimising  on  the  parameters,  with 
objective  function  the  statistic  being  bounded.  This  tends  to  be  computationally  expensive 
and  not  advisable  when  estimation  speed  is  important. 

An  an  example,  suppose  we  know  the  minimum  of  the  set  of  data  is  10,  the  maximum 
is  30,  the  mean  is  IS,  and  the  standard  deviation  is  1.  Then  the  linear  bounds  on  the 
mean  of  the  logarithm  are  2.650  and  2.704;  the  Taylor-series  bounds  found  by  taking 
the  Taylor-series  at  the  mean  are  2.680  and  2.716;  the  LaGrange- Chebyshev’s  bounds 
are  2.701  and  3.709;  the  Pseudo-Order-Statistics  bounds  are  2.700  and  2.711;  and  the 
best  quadratic  bounds  found  by  optimisation  are  2.70S  and  2.708.  For  another 
example,  suppose  the  minimum  is  1,  and  maximum  is  200,  the  mean  is  100,  and  the 
standard  deviation  is  20.  Then  the  linear  bounds  are  5.0S2  and  6.247;  the  Taylor- 
series  bounds  are  1.484  and  6.242;  the  LaGrange-  Chebyahev’s  bounds  are  2.048  and 
6.400;  the  peeudo-order-atatisties  bounds  are  6.868  and  6.242;  and  the  bounds  found 
by  quadratic  optimisation  are  6.032  and  6.242.  These  bounds  are  surprisingly  tight, 
and  should  be  adequate  for  many  appBcations. 

There  is  a  more  direct  optimisation  method  for  this  problem,  involving  treating  the 
optimisation  variables  as  the  values  of  a  distribution  satisfying  certain  constraints  and 


moving  the  vuiabhi  around  until  an  optimum  ia  achieved.  We  have  experimented 
with  each  optimisation,  but  it  ia  considerably  less  well-behaved  than  the  parametric 
one  mentioned  earlier.  It  is  tricky  to  get  to  converge  properly,  even  in  simple 
situations.  This  optimisation  also  suffers  from  serious  sensitivity  to  errors  in 
calculation.  And  since  we  can  only  use  a  small  number  of  variables  compared  to  the 
sises  of  many  interesting  populations,  the  number  converged  to  by  the  optimisation 
process  will  be  only  a  lower  bound  on  an  upper  bound,  or  an  upper  bound  on  a  lower 
bound,  and  these  things  are  considerably  less  helpful  to  us  than  the  upper  bounds  on 
upper  bounds  and  lower  bounds  on  lower  bounds  obtained  with  the  rule-based 
inferences  discussed  above.  This  is  a  fundamental  weakness  of  these  "direct" 
optimisation  methods,  and  an  important  Justification  for  our  approach. 


4.3.  Optimal  rules  relating  statistics  on  the  same  distribution 


Another  category  of  rulea  relates  statistics  cm  the  same  attribute  of  the  same  set  (as 
when  one  estimates  or  bounds  the  mean  given  the  median).  Many  of  these  situations 
are  instances  of  the  "isopertmetric  problem"  of  the  calculus  of  variations  ([22],  ch.  4), 
for  which  there  Is  a  general  eolation  method.  The  mathematics  becomes  complicated 
even  for  some  rather  simple  problems,  but  the  rules  generated  are  mathematically 
guaranteed  to  be  the  best  possible,  an  important  advantage. 

The  idea  is  to  find  a  probability  distribution  that  has  an  extreme  value  for  either  some 
statistic  or  the  entropy  of  the  distribution,  and  then  find  the  extreme  value.  Let  the 
probability  distribution  we  are  trying  to  determine  be  jr  »  f(x).  Suppose  we  have 
some  integral  we  wiah  to  maximise  or  minimise: 

M 
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Suppose  we  have  prior  constraints  on  the  problem  as  known  statistics  expressable  as 
integrals: 

Af 

where  j  goes  from  I  to  k,  the  total  number  of  known  statistics.  As  before,  the  limits  m 
and  M  represent  the  minimum  and  maximum  on  the  distribution,  or  at  worst  lower 
and  upper  bounds  respectively  on  these  quantities;  these  are  necessary  for  this  method 
to  work,  and  they  must  be  the  same  for  all  integrals. 


As  examples  of  statistics  expressible  as  integrals: 

y 

mean:  ftydt 
y 

variance:  J 
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median:  J *-i(*  -o)pdx  ,  u.t(r)  the  unit  step  function 


root  mean  square  error: 
y 


It  was  proved  by  Lagrange  ([22],  p.  51)  that  a  necessary  condition  for  an  extreme m 


(either  maximum  or  minimum)  of  the  T  integral  Is 


If  the  P  is  jr*log(jr),  this  method  give*  a  necessary  condition  for  the  maximum-entropy 
distribution.  Several  researchers  have  used  this  to  obtain  maximum-entropy  estimates 
of  unknown  moment  statistics  from  knowledge  of  other  moment  statistics,  in  both  the 
imMini— Sm»«l  and  multidimensional  cases  ([IT],  Appendix).  For  the  uni  dimensional 
case,  the  form  of  the  maximum  entropy  distribution  given  moments  up  through  the 
rth  is 

r 
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The  remaining  problem  Is  to  determine  the  As  (Lagrange  multipliers),  which  can  be 
tricky.  A  number  of  arguments  in  (IT]  Justify  the  term  "optimal”  for  these  estimates. 

P  can  also  be  a  statistic  itself.  For  instance  if  P  is  the  kth  moment  when  we  know 
values  for  all  moments  up  through  the  (k-l)th,  the  necessary  condition  for  a  solution 
becomes: 

.  **  +  EW'Ho  .  "... 
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This  is  a  kth-order  polynomial,  with  a  maximum  of  k  solutions.  Hence  the  probability 
distribution  that  gives  the  extrema  of  the  kth  moment  la  a  k-point  discrete  probability 
distribution.  It  can  be  found  by  a  symbolic  optimisation  process  with  2k  unknowns  (k 
values  of  x,  and  k  associated  probabilities)  with  k  equality  constraints  in  the  form  of 
the  known  k-1  moments  plus  the  knowledge  that  the  probabilities  must  sum  to  I. 


5.  Detailed  comparision:  antisampling  vs.  sampling 

We  now  evaluate  the  relative  merits  of  sampling  and  antisampling.  We  assume  data 
populations  stored  in  computers  (a  condition  that  is  becoming  increasingly  common 
with  routine  administrative  data). 


5.1.  Miscellaneous  advantages  of  antisampling 

Most  of  our  arguments  concern  the  relative  efficiency  of  various  kinds  of  sampling  vs. 

antisamplng.  But  first  some  general  points: 

(1)  Sometimes  the  data  is  already  aggregated.  Much  of  the  published  U.S.  Census 
data  is  —  it  provides  privacy  protection  for  an  individual^  data  values.  So  we 
muat  use  antisampHng  methods  in  some  form  if  we  want  to  estimate  statistics 
not  in  the  original  tabulation  —  we  have  no  other  choice. 

(2)  Sampling  is  poor  at  estimating  extremum  statistics  like  maximum  and  mode 
frequency.  Extremum  statistics  have  important  applications  in  Identifying 
exceptional  or  problematic  behavior.  AntisampHng  handles  such  statistics  well, 
in  part  because  it  can  use  extremum  statistics  of  the  entire  database  as  bounds. 
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(S)  Updates  to  the  datibut  cut  create  difficnltiei  for  umplei,  since  the  information 
a  boat  what  records  the  samples  were  drawn  from  will  usually  be  thrown  away. 
For  entire  wiping  with  many  statistics  including  counts  and  sums,  however,  the 
original  data  is  not  needed:  the  untisample  statistics  can  be  updated  themselves 
without  additional  information. 


6.3.  Experiments 

We  have  conducted  a  number  of  experiments  comparing  accuracy  of  antisampling 
with  simple  random  sampling,  using  randomly  generated  queries  on  two  rather 
databaaes,  as  reported  in  chapter  6  of  (IS).  When  the  same  amount  of  space 
was  allocated  to  autissmplng  and  sampling  (that  is,  the  else  of  the  database  abstract 
was  the  same  as  the  else  of  the  sample)  we  found  estimation  performance  (the 
closeness  of  estimates  to  actual  values)  very  similar  in  most  cases,  and  better  for 
awttsa wiping  the  rest  of  the  time.  This  can  be  attributed  to  the  duality  of  sampling 
and  antisawiplhig  methods.  Both  exploit  low-redundancy  encodings  of  typically  high* 
redundancy  databaae,  so  we  expect  their  information  content  and  suitability  for 
estimation  to  be  similar.  An  occasional  better  performance  of  antlaampUng  seems  due 
to  bounds. 

We  have  also  conducted  .more  specific  experiments  with  the  set  Intersection  bounds  of 
section  4.1  (16],  and  the  transformation  mean  bounds  of  section  4.3  [IS].  AD  three  sets 
of  experiments  did  not  measure  computation  time  because  the  test  databases  were  too 
small,  but  we  expect  that  this  wiQ  be  the  major  advantage  of  antisampHng,  as  we  now 


6.1.  Simple  random  sampling  and  paging 

We  are  currently  seeing  two  important  tendencies  in  statistical  analysis  of  data  sets  on 
computers  [11]:  a  shift  from  large  multiuser  computers  to  small  personal  computers, 
and  a  continued  increase  in  the  sise  of  data  sets  analysed  as  success  has  been  achieved 
with  smaller  data  sets.  Both  make  it  increasingly  impossible  for  analysis,  or  even 
calculation  of  a  mean,  to  be  carried  out  in  main  memory  of  a  computer,  and  secondary 
storage  issues  are  increasingly  important.  This  is  significant  because  secondary 
storage  like  magnetic  disks  and  optical  disks,  and  tertiary  storage  like  magnetic  tape, 
is  organised  differently  from  main  memory:  it  is  broken  up  into  "pages”  or  "blocks" 
that  must  be  handled  as  a  unit.  This  is  not  likely  to  change  very  soon,  as  it  fallows 
from  the  physical  limitations  of  secondary  and  tertiary  storage.  So  since  transfer  of 
pages  from  a  aecondary  storage  device  to  a  central  processor  takes  orders  of 
magnitude  (typically,  a  factor  of  1000)  more  than  the  operations  of  that  processor  or 
transfers  within  main  memory,  paging  coat  is  the  only  cost  of  significance  in  statistical 
analysis  of  large  data  sets. 


This  has  important  implications  for  sampling  methods  because  they  are  much  less 
efficient  when  data  is  kept  in  secondary  storage  than  main  memory.  Consider  simple 
random  sasaplfag  without  replacement.  We  can  use  Yao’s  standard  formula  [36]  to 
estimate  the  number  of  pages  that  need  to  be  retrieved  to  obtain  k  sample  items, 
assuming  hems  are  randomly  distributed  acroes  pages,  in  Just  the  same  way  the 
formula  is  used  for  any  set  randomly  distributed  acroes  pages.  Let  p  be  the  number  of 
items  on  each  database  page,  and  let  n  be  the  number  of  items  in  the  entire  database. 
Then  the  formula  is: 


n 


w  -p  -k  +1 
i»  -i+1 


v. 
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We  km  tabulated  approximations  to  this  function  for  some  example  values  la  Picon 
ft,  «ria|  the  formula  of  {3ft]  which  is  much  easier  to  evaluate  while  having  a  maximum 
error  foor  this  range  of  mines  ( reading  off  the  tables  in  that  paper)  of  less  than  O.lX. 
We  assumed  a  million-record  database.  We  need  two  values  for  page  sise:  p*ioo, 
which  suggests  a  record-oriented  database  with  perhaps  ten  attributes  per  record,  and 
p«1000,  which  suggests  the  transposed  file  organisation  common  with  statistical 
databases.  As  may  be  observed,  the  number  of  pages  retrieved,  essentially  the  access 
cost  for  data  in  secondary  storage,  is  close  to  the  sise  of  the  sample  for  small  samples. 
It  only  becomes  significantly  less  when  the  sample  sise  approaches  the  total  number  of 
pages  la  the  database,  in  which  case  the  number  of  pages  retrieved  approaches  the 
number  of  database  pages,  a  situation  in  which  sampling  is  useless.  So  simple  random 
sampling  is  going  to  be  approximately  p  times  less  page-efficient  than  a  foil  retrieval  of 
the  entire  database,  which  means  100  or  1000  times  worse  for  our  example  values  of  p. 
The  obvious  question  is  thus:  why  not  Just  calculate  the  statistic  on  the  database  and 
not  bother  with  the  inexact  answer  provided  by  sampling? 

But  autieampBng  does  not  share  this  great  paging  inefficiency.  Assuming  all  statistics 
of  attributes  of  each  antisemple  are  stored  together  on  the  same  physical  page  —  a 
requirement  usually  easy  to  folfiU  since  there  are  not  many  usefol  statistics  one  can 
give  for  a  set  as  a  whole  —  only  one  page  need  be  retrieved  for  each  antisample  used. 
Usually  this  is  a  small  number.  If  we  choose  a  good  group  of  antisamples,  we  can 
specify  many  populations  users  ash  about  in  terms  of  set  operations  —  Intersection, 
union,  and  complement—  an  the  nets  covered  by  the  antisamples,  or  at  wont  proper 
subnets  of  those  sutfsampha.  For  Instance,  if  we  want  to  know  the  mean  of  the 
American  tankers  in  the  Mediterranean,  and  we  have  antisamples  for  every  major 
nationality,  major  ship  type,  and  region  of  the  oceans,  we  need  only  retrieve  three 
pages:  the  page  with  statistics  about  American  ships,  the  page  with  statistics  about 
tankers,  and  the  page  with  statistics  of  ships  in  the  Mediterranean.  In  general,  if  it  is 
possible  to  express  a  population  P  In  terms  of  K  antisamples,  we  need  only  retrieve  K 
pages.  Independent  of  the  else  of  P,  the  siscs  of  the  antisamples,  or  the  sise  of  the 
database.  So  as  the  database  increases,  the  relative  advantage  of  »■*«— mpMng  to 


S.4.  Further  difficult  lee  with  simple  random  sampling 

Three  additional  problems  complicate  the  use  of  simple  random  sampling  relative  to 
antisampftng.  First,  it  Is  usually  desirable  that  sampling  be  without  replacement,  and 
additional  algorithms  and  data  structures  are  needed  to  ensure  this  [35). 

Second,  we  have  so  for  we  have  ignored  the  effort  to  locate  members  of  a  data 
population  on  pages  in  the  first  place,  which  can  add  fort  her  paging  costs.  If  we  have 
no  Index  or  hash  table,  we  simply  must  examine  each  page  in  turn,  throwing  out  the 
ones  that  have  no  population  members,  and  this  increases  the  number  of  pages  fetches. 
For  small  populations,  this  means  a  high  wastage  probability  that  can  easily  be 
greater  than  the  else  of  the  sample.  So  it  seems  desirable  to  access  a  population 
through  an  index  or  hash  table  whenever  possible.  But  an  index  may  be  too  big  to 
reside  In  main  memory,  and  have  paging  costs  itself.  Usually  database  indexes  fink 
together  Items  having  the  same  value  for  one  particular  attribute  at  a  time,  00  if  a 
data  population  P  of  Interest  is  specified  by  a  number  of  restrictions  on  a  number  of 
different  attributes,  many  pages  of  index  may  have  to  be  retrieved  followed  by  a 
lengthy  Intersection  operation  of  the  set  of  all  pointers  to  data  items,  w^fcfog  can 
more  easily  avoid  extra  paging,  but  usually  allows  access  on  only  one  attribute  or 
combination  of  attributes,  which  means  it  is  does  not  improve  performance  much  In 


s' 


Third, 


itotUkd  databuH  in  not  stored  by  record  or  "case”  bat  in  the 
fens  ([M],  section  4-1-4),  where  only  values  for  one  attribute  for 
l  (or  sense  small  snbest  of  the  total  set  of  attributes)  are  stored  on  a 
an  efleient  fcrn  of  storage  foe  calculation  of  counts  and  ««»■«  cm  a 
to  because  there  a re  more  values  of  that  attribute  pm  page.  But  it 
t  bob  sampling  because  the  only  sampling  ratios  that  justify  sampling, 
thorn  arguments,  tend  to  be  very  small,  much  lees  than  the  reciprocal  of 
of  Memo  par  page,  lurrsesiug  the  number  of  items  per  page  by 
•an  only  Increase  this  by  a  small  foe  tor  In  most  cases  (at  best  the  ratio  of 
il  record  to  the  else  of  an  attribute),  which  will  often  etil  result  In  only 
|  folrhod  per  page.  Transposition  also  slows  all  quarks  involving  several 


s  sampNug  are  clear  and  It  may  be  wondered 
could  bo  more  competitive  with  antkamplng. 
reh  has  gone  into  devising  a  wealth  of  sampling 
■iqaep  seam  to  have  other  disadvantages. 
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that  k,  putting  data  items  onto 
n  ana  could  take  ad  the  items  on 
'  pages,  and  save  In  paging  [10). 
a  no  good  —  the  actual  data  item 
ion  it  sounds.  A  policy  has  to  be 


k  so  antithetical  to 


some  sample  S  we  decide  we  need  to  look  at  a  superpopulation  of  the  original 
population  P,  or  siblng  population  of  P  (a  population  with  a  common 
super  population  with  P),  we  must  go  back  to  the  database  and  resample  all  over  again 
to  obtain  a  new  sample  to  analyse  —  we  cannot  nee  any  part  of  oar  old  sample  In  the 
snperpopnlatloa  sample  because  clearly  It  Is  biased  in  relation  to  the  new  sample.  On 
the  other  hand,  if  we  chose  an  original  data  population  P  that  was  too  large  (even 
though  8  Is  a  comfortable  else)  and  decide  we  want  to  focus  In  on  some  subpopulation 
of  it,  merely  censoring  out  items  In  8  that  do  not  belong  to  the  subpopulation  may 
give  too  small  a  set  far  statistical  analysis,  particularly  If  the  new  subpopuiatkm  is 
quite  a  bit  smaller  than  the  old.  In  other  words,  sampling  is  "brittle":  the  results  of 
one  sampling  are  difficult  to  extend  to  a  related  sampling. 

But  anttsampMag  extends  gracefhUy  to  related  populations.  Adding  another 
restriction  to  restrictions  defining  a  set  is  usually  straightforward,  and  can  never 
worsen  bounds  obtained  without  the  restriction  —  and  the  the  parts  of  the  previous 
analysis  can  be  reused.  Similarly,  removing  a  restriction  introduces  no  new  problems 
since  analysis  of  the  new  population  was  a  subproblem  studied  In  reasoning  about  the 
original  population.  This  accommodation  of  related  user  queries  by  mpHng  Is 
because  much  statistical  analysis  focuses  on  meaningful  sets,  not  random  sets,  and 
antieamplee  are  sets. 

5.7.  Rejoinder  S:  stratified  and  multistage  sampling 

Given  the  disadvantages  of  randomising  the  physical  placement  of  items  in  a  database, 
we  might  take  the  opposite  conns  and  place  items  on  pages  In  systematic  ways.  To 
sample  we  could  use  the  same  techniques  people  use  in  sampling  a  real  world  where 
data  Kerns  cluster  in  dUfinunt  ways  [1,5].  Tor  instance,  if  pages  represent  time  periods, 
we  could  do  a  two-stage  sampfing  where  we  choose  first  random  periods  represented 
by  random  pages,  and  then  random  Items  within  thorn  pages.  Or  in  a  population 
census  database,  If  pages  represent  particular  pahs  of  geographical  locations  and 
occupation,  we  could  do  a  stratified  sampfing  within  carefoBy  chosen  geographical- 
occupational  combinations. 

But  there  are  many  problems  with  using  such  sampling  paradigms: 

1.  They  are  not  for  amateun.  Meek  knowledge  about  the  nature  of  the  data  is  necessary  to 
use  them  properly  -  perhaps  only  by  aa  expert  statistician  should,  and  even  then  models  of 
the  data  must  be  reconfirmed  carefully.  This  can  mean  extensive  prior  statistical  study  of 
related  data,  aa  in  the  first  example  above  where  we  must  be  sure  that  the  times  chosen  are 
truly  random,  or  in  the  second  example  where  the  geographical-occupational  combinations 
must  be  valid  strata. 

3.  It  is  hard  to  quantify  our  certainty  that  proper  conditions  pertain,  and  it  is  therefore 
difficult  to  put  standard  deviations  on  the  estimates  obtained  by  these  samples. 

S.  If  the  data  change  with  time  their  correlational  properties  may  also  change.  Changes  can 
cause  problems  with  pages  overflowing  or  becoming  too  sparse,  requiring  awkward 
immediate  rearrangements  of  the  partitioning  scheme. 

4.  We  can  only  cluster  (group  semantically  related  items  together)  along  one  dimension  at  a 
time.  For  instance,  if  we  group  bills  by  date,  we  cannot  simultaneously  group  them  by 
geographical  location.  This  is  awkward  because  a  good  partitioning  for  stratified  sampling 
to  study  one  attribute  is  not  necessarily  a  good  partitioning  for  another  attribute  -  database 


stratiAcatioa  ■  pmnunt  unlike  survey  design  stratification.  And  grasping  records  by 
"hybrid"  criteria  baaed  on  different  dimensions  of  the  data  is  bard  to  analyse. 

5.  Complex  sampling  paradigms  are  limited  to  certain  statistics.  For  instance,  stratified 
sampling  only  works  well  with  "additive"  statistics  sack  as  eount  and  sum  that  can  be 
totalled  over  disjoint  partitions  to  get  a  cumulative  statistic,  os  well  as  certain  ratio 
statistics. 

6.  Complex  sampling  paradigms  may  require  additional  page  access.  In  order  to  find  the 
right  pages  for  stratified  sampling  or  multistag ed  clustered  sampling,  one  needs  "metadata” 
(9]  describing  the  data  and  its  storage,  and  the  sise  of  this  often  requires  it  be  in  secondary 
storage.  Metadata  is  useful  for  many  other  purposes  such  as  access  method  selection  and 
integrity  maintenance,  so  there  can  be  a  good  deal  of  it  few  a  database.  It  also  makes  sense 
to  keep  it  with  indexes  to  the  data,  if  any,  and  theae  may  have  to  be  kept  in  secondary 
storage  anyway. 

7.  The  compulsion  of  stratified  and  multistage  sampling  to  simple  antioampling  is  unfair 
because  than  an  mam  sophisticated  kinds  of  antisampling  that  correspond  to  the  non 
sophisticated  kinds  of  sampling.  For  instance,  "stratified  antisampling”  can  be  done  when 
we  partition  a  population  into  disjoint  piecaa  an  with  stratified  — p^g  but  then  use 
antisampling  techniques  to  make  the  estimate  on  each  piece,  combining  the  reeults  as  with 

■  •  v  stratified  sampling.'  See  Figure  6.  If  the  pieces  an  chosen  to  minimise  intrapiece  vuiation, 
•v  the  result  can  be  better  than  that  for  simple  antisampling..  Sometimes  stratified  sampling 
will  be  better  than  stratified  antioampling,  and  sometimes  vice  versa,  in  the  same  way  that 
sampling  compares  with  antisarapling  depending  on  how  well  the  nature  of  the  database  is 
captursd  in  the  database  abstract. 

In  swmrnnry,  dtffrcuK  administrative  issues  In  both  statistical  analysis  and  database 
design  are  raised  by  theae  more  rnnqiH  rated  sampling  designs,  and  people  with  the 
necessary  axpertlas  are  scarce.  It  may  be  on  this  reason  alone  wiB  not  be  need, 
because  if  one  cannot  be  aura  one  Is  getting  a  random  sample  than  ell  the  conclusions 
one  draws  from  that  sample  era  suspect. 


5.S.  Rejoinder  4:  special-purpose  hardware 

So  fhr  we  have  assumed  conventional  hardware.  If  not,  statistical  calculations  can  be 
foster,  but  this  does  not  necessarily  make  sampling  any  more  attractive. 

For  example,  we  can  use  logk-per- track  secondary  storage  devices  (surveyed  In  [It]). 
We  can  put  hardware  In  the  kend(s)  of  a  disk  so  that  it  calculates  the  sum  of  all  Items 
on  a  track  within  one  revolution  of  the  track,  or  calculates  the  cum  of  selected  items 
by  marking  the  items  on  one  revolution  and  summing  them  on  the  next.  The  Men  can 
work  for  any  moment  statistic,  or  maximum  end  minimum,  but  other  order  and 
frequency  statistics  are  not  additive  in  this  sense  and  do  not  appear  to  be  computable 
this  way.  So  we  can  speed  calculation  of  some  statistics,  perhaps  additionally  with 
parsUslism  In  rend  operations  mi  different  disks  or  tracks,  if  we  can  afford  a  special- 
purpose  "moment-calculating"  disk,  which  is  likely  to  be  expensive  because  of  the 
limited  demand.  But  such  e  device  would  speed  calculation  of  the  exact  statistic  on 
the  data  too,  hastening  construction  of  a  database  abstract.  Construction  might  be 
very  efficient  because  H  can  be  done  by  e  single  pass  through  ell  tracks  of  the  disks  in 
e  disk-based  database,  an  intensive  utilisation  of  each  track. 

Similarly,  multiple  disks  or  multi-hssd  disks  could  enable  foster  statistical  calculations 


n 


since  optntkM  coaid  be  dose  on  mtctu  devices  in  parallel.  Bat  this  doesn’t  make 
the  p*|*t  pcobha  go  away  —  It  jast  naakss  paging  fester.  And  it  makes  database 
abstract  construction  simultaneously  fester. 

Bat  there  Is  oae  hardware  development  that  will  improve  the  position  of  sampling 
idadn  to  satisampMag:  larger  main  memories  that  can  hold  larger  databases. 
Antisampting  can  stil  be  performed  in  this  sitaation  (and  can  be  thought  of  as  a  form 
of  caching),  bat  the  paging  advantage  disappears.  Other  advantages  do  not 
disappear,  however.  And  database  rises  are  increasing  too. 


We  are  developing  a  new  technique  for  estimating  statistics,  primarily  statistics  on 
databases.  This  "antisamping"  is  not  Jast  another  sampling  method,  bat  something 
fundamentally  different,  and  subject  to  quite  different  advantages  and  disadvantages 
than  sampling.  We  have  presented  some  of  them  One  disadvantage  not  yet  mentioned 
is  the  aambar  of  details  that  remain  to  be  worked  oat.  Conaidering  the  great  effort 
over  the  years  in  the  perfection  of  sampling  techniques,  much  more  work  is  clearly 
needed  to  make  antisampling  techniques  a  routine  part  of  a  statistical  analysis  arsenal. 
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Figure  4:  stratified  antuampling 
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